White Wine Data Analysis by John Ngo

## [1] "/Users/johnngo/Desktop"

Tip: Before you create any plots, it is a good idea to provide a short introduction into the dataset that you are planning to explore. Replace this quoted text with that general information!

In our dataset, we will be exploring the different variables in white wine to draw insightful meaning on the relationship between the variables and quality.

Univariate Plots Section

Tip: In this section, you should perform some preliminary exploration of your dataset. Run some summaries of the data and create univariate plots to understand the structure of the individual variables in your dataset. Don’t forget to add a comment after each plot or closely-related group of plots! There should be multiple code chunks and text sections; the first one below is just to help you get started.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

Individual Plots

Our dataset contains 4898 observations and 13 variables. To show several similar histograms in one plot we need to use facets where we show multiple histograms in one shot, using the melt method.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

The median fixed acidity in the wine is 6.8 g/l, we can see that commonly white wine have an acidity level between 5.5 - 8.5 g/l.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

The distribution of volatile acidity is slightly right skewed with a median of 0.26 g/l. There are some outliers on the higher end of the scale.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

The overall distribution of citric acid is normal with the median being 0.32g/l and the mean at 0.334g/l.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Our residual sugar plot has a median value of 5.2g/l. The distribution is right skewed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Our plot for chloride shows a median of 0.043 g/l with a normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

Our free sulfur dioxide plot is normally distributed and slightly right skewed. The median value is 34 g/l.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

Our total sulfur dioxide graph has a normal distribution and it’s slightly right skewed. The median value is 134 g/l.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

The density of our white wines are in a very narrow range of .9917 - .9961. The median value is .9937.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

Our pH plot shows an overall pH level ranging between 2.9-3.5 with a normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Our sulphate plot is slightly right skewed with a median of .47g/l.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Our alcohol plot is right skewed. We notice it starts at 8%, perhaps a minimum level of alcohol required for a wine.

Univariate Analysis

Tip: Now that you’ve completed your univariate explorations, it’s time to reflect on and summarize what you’ve found. Use the questions below to help you gather your observations and add your own if you have other thoughts!

What is the structure of your dataset?

The dataset is a long format data with 4898 observation with 13 variables. 11 of the variables are measurements of a chemical property and one variable measuring the overall taste quality. Lastly, one variable listing the unique observation ID.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is the quality rating

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I think by testing the different supporting variables will provide some insightful information that may help us with the investigation. We understand that some of the variables may have more of an impact on quality compared to the other ones.For example, we may notice some variables that may have a stronger correlation compared to others.

Did you create any new variables from existing variables in the dataset?

No new variables were created in the dataset.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

There were no unusual distributions, no missing value and no need to adjust for data. The current dataset is already cleaned which makes it a good dataset to analyze.

Bivariate Plots Section

Tip: Based on what you saw in the univariate plots, what relationships between variables might be interesting to look at in this section? Don’t limit yourself to relationships between a main output feature and one of the supporting variables. Try to look at relationships between supporting variables as well.

## [1] "median of fixed.acidity by quality:"
## dataset$quality: 3
## [1] 7.3
## -------------------------------------------------------- 
## dataset$quality: 4
## [1] 6.9
## -------------------------------------------------------- 
## dataset$quality: 5
## [1] 6.8
## -------------------------------------------------------- 
## dataset$quality: 6
## [1] 6.8
## -------------------------------------------------------- 
## dataset$quality: 7
## [1] 6.7
## -------------------------------------------------------- 
## dataset$quality: 8
## [1] 6.8
## -------------------------------------------------------- 
## dataset$quality: 9
## [1] 7.1

We see a steady level of fixed acidity and it seems that fixed acidity is relatively stable across the different level of quality. Additionally, we see big dispersion of acidity values across the different quality levels. This may suggest that there may be other variables at play that contribute to the overall quality.

## [1] "median of volatile.acidity by quality:"
## dataset$quality: 3
## [1] 0.26
## -------------------------------------------------------- 
## dataset$quality: 4
## [1] 0.32
## -------------------------------------------------------- 
## dataset$quality: 5
## [1] 0.28
## -------------------------------------------------------- 
## dataset$quality: 6
## [1] 0.25
## -------------------------------------------------------- 
## dataset$quality: 7
## [1] 0.25
## -------------------------------------------------------- 
## dataset$quality: 8
## [1] 0.26
## -------------------------------------------------------- 
## dataset$quality: 9
## [1] 0.27

The median level of volatile acidity is stable across the the different levels of quality, however, we do notice a slight dip as the quality rating increases.

## [1] "median of citric.acid by quality:"
## dataset$quality: 3
## [1] 0.345
## -------------------------------------------------------- 
## dataset$quality: 4
## [1] 0.29
## -------------------------------------------------------- 
## dataset$quality: 5
## [1] 0.32
## -------------------------------------------------------- 
## dataset$quality: 6
## [1] 0.32
## -------------------------------------------------------- 
## dataset$quality: 7
## [1] 0.31
## -------------------------------------------------------- 
## dataset$quality: 8
## [1] 0.32
## -------------------------------------------------------- 
## dataset$quality: 9
## [1] 0.36

With our observation, we notice that there is a slight increase in quality with an increase in citric acid.

## [1] "median of residual.sugar by quality:"
## dataset$quality: 3
## [1] 4.6
## -------------------------------------------------------- 
## dataset$quality: 4
## [1] 2.5
## -------------------------------------------------------- 
## dataset$quality: 5
## [1] 7
## -------------------------------------------------------- 
## dataset$quality: 6
## [1] 5.3
## -------------------------------------------------------- 
## dataset$quality: 7
## [1] 3.65
## -------------------------------------------------------- 
## dataset$quality: 8
## [1] 4.3
## -------------------------------------------------------- 
## dataset$quality: 9
## [1] 2.2

Residual sugar seems a bit sporadic relative to quality. It may have a low impact on the quality of wine.

## [1] "median of chlorides by quality:"
## dataset$quality: 3
## [1] 0.041
## -------------------------------------------------------- 
## dataset$quality: 4
## [1] 0.046
## -------------------------------------------------------- 
## dataset$quality: 5
## [1] 0.047
## -------------------------------------------------------- 
## dataset$quality: 6
## [1] 0.043
## -------------------------------------------------------- 
## dataset$quality: 7
## [1] 0.037
## -------------------------------------------------------- 
## dataset$quality: 8
## [1] 0.036
## -------------------------------------------------------- 
## dataset$quality: 9
## [1] 0.031

Based on our observation, there is a very slight relation, as chloride decreases, the quality increases marginally.

## [1] "median of free.sulfur.dioxide by quality:"
## dataset$quality: 3
## [1] 33.5
## -------------------------------------------------------- 
## dataset$quality: 4
## [1] 18
## -------------------------------------------------------- 
## dataset$quality: 5
## [1] 35
## -------------------------------------------------------- 
## dataset$quality: 6
## [1] 34
## -------------------------------------------------------- 
## dataset$quality: 7
## [1] 33
## -------------------------------------------------------- 
## dataset$quality: 8
## [1] 35
## -------------------------------------------------------- 
## dataset$quality: 9
## [1] 28

Our free sulfur dioxide plot, takes a slight dip, then flattens out relative to the quality.

## [1] "median of total.sulfur.dioxide by quality:"
## dataset$quality: 3
## [1] 159.5
## -------------------------------------------------------- 
## dataset$quality: 4
## [1] 117
## -------------------------------------------------------- 
## dataset$quality: 5
## [1] 151
## -------------------------------------------------------- 
## dataset$quality: 6
## [1] 132
## -------------------------------------------------------- 
## dataset$quality: 7
## [1] 122
## -------------------------------------------------------- 
## dataset$quality: 8
## [1] 122
## -------------------------------------------------------- 
## dataset$quality: 9
## [1] 119

With our total sulfur dioxide, we a see a temporary pop, then flattens out.

## [1] "median of density by quality:"
## dataset$quality: 3
## [1] 0.994425
## -------------------------------------------------------- 
## dataset$quality: 4
## [1] 0.9941
## -------------------------------------------------------- 
## dataset$quality: 5
## [1] 0.9953
## -------------------------------------------------------- 
## dataset$quality: 6
## [1] 0.99366
## -------------------------------------------------------- 
## dataset$quality: 7
## [1] 0.99176
## -------------------------------------------------------- 
## dataset$quality: 8
## [1] 0.99164
## -------------------------------------------------------- 
## dataset$quality: 9
## [1] 0.9903

We notice a pattern with density, as the density level decreases, the overall quality increases.

## [1] "median of pH by quality:"
## dataset$quality: 3
## [1] 3.215
## -------------------------------------------------------- 
## dataset$quality: 4
## [1] 3.16
## -------------------------------------------------------- 
## dataset$quality: 5
## [1] 3.16
## -------------------------------------------------------- 
## dataset$quality: 6
## [1] 3.18
## -------------------------------------------------------- 
## dataset$quality: 7
## [1] 3.2
## -------------------------------------------------------- 
## dataset$quality: 8
## [1] 3.23
## -------------------------------------------------------- 
## dataset$quality: 9
## [1] 3.28

We notice a slight trend with the increase in pH, where quality tends to follow with a higher level of pH.

## [1] "median of sulphates by quality:"
## dataset$quality: 3
## [1] 0.44
## -------------------------------------------------------- 
## dataset$quality: 4
## [1] 0.47
## -------------------------------------------------------- 
## dataset$quality: 5
## [1] 0.47
## -------------------------------------------------------- 
## dataset$quality: 6
## [1] 0.48
## -------------------------------------------------------- 
## dataset$quality: 7
## [1] 0.48
## -------------------------------------------------------- 
## dataset$quality: 8
## [1] 0.46
## -------------------------------------------------------- 
## dataset$quality: 9
## [1] 0.46

Sulphates are fairly stable across the board relative to the quality.

## [1] "median of alcohol by quality:"
## dataset$quality: 3
## [1] 10.45
## -------------------------------------------------------- 
## dataset$quality: 4
## [1] 10.1
## -------------------------------------------------------- 
## dataset$quality: 5
## [1] 9.5
## -------------------------------------------------------- 
## dataset$quality: 6
## [1] 10.5
## -------------------------------------------------------- 
## dataset$quality: 7
## [1] 11.4
## -------------------------------------------------------- 
## dataset$quality: 8
## [1] 12
## -------------------------------------------------------- 
## dataset$quality: 9
## [1] 12.5

Eventhough there is a slight dip at the quality rating 5, we notice that a higher level alcohol content is associated with a higher rating in wine quality.

Acidity and pH

It looks like fixed.acidity has a negative relationship with pH, as fixed acidity declines, pH increases. We notice out of all the acidity group, fixed acidity carries a bigger weight relative to the other acids.

Based on our observation, when we compare volatile acidity to pH, the relationship is concentrated around the 0.1 to 0.45 and 2.9 to 3.5pH.

With citric acid, similar to volatile acidity, the relationship is concentrated in a particular region of .1-.6 and 2.8 - 3.5pH.

## 
##  Pearson's product-moment correlation
## 
## data:  pH and log10(fixed.acidity)
## t = -33.783, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4572280 -0.4117972
## sample estimates:
##        cor 
## -0.4347892
## 
##  Pearson's product-moment correlation
## 
## data:  pH and log10(volatile.acidity)
## t = -3.7719, df = 4896, p-value = 0.0001639
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08171127 -0.02586052
## sample estimates:
##         cor 
## -0.05382799
## 
##  Pearson's product-moment correlation
## 
## data:  pH and citric.acid
## t = -11.614, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1908793 -0.1363671
## sample estimates:
##        cor 
## -0.1637482

Density, Alcohol, Chloride

We see that density and chlorides concentrated in one area.

We notice that alcohol percentage decreases as density marginally increases.

Alcohol percentage decreases and chlorides marginally increases.

Sulphates, Sulfur Oxide, Chlorides

There is no discernible change as sulphate level changes, free sulfur dioxide remains relatively stable.

There is no discernible change as sulphate level changes, total sulfur dioxide remains relatively stable.

There is no discernible change as sulphate level changes, chlorides remains relatively stable.

##                             [,1]
## X                     0.04199914
## fixed.acidity        -0.08448545
## volatile.acidity     -0.19656168
## citric.acid           0.01833273
## residual.sugar       -0.08206979
## chlorides            -0.31448848
## free.sulfur.dioxide   0.02371338
## total.sulfur.dioxide -0.19668029
## density              -0.34835102
## pH                    0.10936208
## sulphates             0.03331897
## alcohol               0.44036918

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Overall, white wine quality has a stronger relationship with volatile acidity, chlorides, total sulfur dioxide,density and alcohol.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

It was interesting to see a strong relationship between fixed acidity and pH. Perhaps it was due to a higher concentration relative to the other acids.

What was the strongest relationship you found?

The strongest relationship we found with quality is the alcohol percentage.

Multivariate Plots Section

Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.

We notice as volatile acidity decreases and as alcohol increases, the quality of wine increases.

Our plot here is a bit sporadic, we see high quality wine at different points on pH as well as different ranges of fixed acidity.

We notice as chlorides decreases and as alcohol increases, the quality of wine increases.

Total sulfur dioxide seem to have little to no effect on quality as we can see higher level quality is associated with higher level of alcohol percentage.

## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and sulphates
## t = -1.22, df = 4896, p-value = 0.2225
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.04541705  0.01057885
## sample estimates:
##         cor 
## -0.01743277

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The main part of our investigation were to investigate the features that had the highest correlation with quality.

In our plot we see how alcohol and volatile acidity connect with quality ratings. Higher alcohol and lower volatile acidity tend to produce better quality wine.


Final Plots and Summary

Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.

Plot One

Description One

With our plot, we can see the distribution of the volatile acidity across the different range of quality rating. The boxplot shows us the minimum, first quartile, median, third quartile, maximum value. The dots show us the distribution of wine in the categories. We can see the dots concentrated around the middle quality ratings and lower frequency on the lower and higher part of the quality ratings. The red line running across the boxplots helps with visualizing the trend between volatile acid and quality rating. We see that as volatile acid declines, the rating quality increases.

Plot Two

Description Two

We can see alcohol contributing to the quality of wine. Eventhough the box plot may suggest that the impact declines from the rating quality of 3-5, we can see a strong incline from 5 and onwards on quality.

Plot Three

Description Three

When we compare volatile acid and alcohol, we see that these two variables which are both correlated to quality have an impact on the rating quality of wines. On the plot, we notice that the lower quality wine have high volatile acid and low alcohol level and as we move to the right of the graph, we will see lower levels of volatile acid and high alcohol level associated with higher wine quality.

Reflection

Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.

This project was a great oppportunity to apply some of the R skills which we learnt earlier in the lessons. It was a great way to explore the different function and plotting feature that R offers.

One of the challenges was selecting a meaningful variable that you wanted to dig deeper into and build around that. In addition, selecting supporting variables that are highly correlated to your main variable was also a challenge and insightful at the same time.

Because R is such a powerful tool, it made exploring the data much more effective, it helps us see trends and allows us to draw meaningful insights.

With this project, we were able to identify some of the trends in the data, perhaps we can build prediction models and see how this trend can be used to predict the wine quality based on the unique variables.


References

https://classroom.udacity.com/nanodegrees/nd002/parts/0021345407/modules/316518875375460/lessons/755209509/concepts/8624987570923

https://classroom.udacity.com/nanodegrees/nd002/parts/0021345407/modules/316518875375460/lessons/770038733/concepts/8822092900923